Introduction

It is oft-mentioned that the United States has the highest per-capita incarceration rate in the developed world. Politicians across the political spectrum attribute this fact to various factors, causing wide-spread confusion. Furthermore, there are two major crime bills of the past 40 years, introduced in 1984 (during the Reagan administration) and 1994 (during the Clinton administration). In this project, we seek to investigate trends in the prison population over the past few decades and draw some conclusions about various potential drivers. For example, was there a particularly sharp change in the overall population after the implementation of one of these bills? How has the racial and gender breakdown of the prison population changed over time? Have certain states contributed more to the increase than others? With consideration of these questions, we hope to gain a more sophisticated understanding of the recent history of the US prison population.

Data Sources

We initially had three unrelated potential project ideas worth investigating, and split them up amongst the three of us to investigate data availability. Jamie was in charge of seeking data for a potential project on the US prison system. At first, he looked in the Bureau of Prisons, which required lots of phone calls to no avail. He was eventually directed to the Bureau of Justice Statistics, where he was able to find a large time series dataset. This was a thorough enough dataset that we chose to commit to this project idea, and it served as the primary data source for the project going forward.

The dataset is presented in an untidy format; there is a column for year (1976 - 2016), a column for state, and about two hundred more columns with numeric values. In total, there are 2106 rows and 210 columns. These other columns represent lots of different potentially interesting slices of the prison population. For example, there are columns for each type of prison, each type of custody, every combination of sex and race, a few sentence types, etc.

One issue with the data are missing values. For example, we hoped to investigate private prisons, but there was a lack of reporting of these data, causing us to not be able to do much meaningful analysis on these populations. Another issue was lack of total population. We sought to standardize the incarceration rates over time to the population, so we needed to seek other means to obtaining these data. A final issue is the lack of tidiness, which took a toll on efficiency and organization of the data.

We found the population data from Google Public Data, Google shows the data in a line chart, and we tried to crawl that data and finally succeeded. The data result and the script to crawl that data is stored in our Github. Visit our github link for more information: https://github.com/TianyaoHan/EDAV_final

Missing Values

Prison Data Missing Data Analysis

Install and Load needed packages

Description of data

We use mainly two data sources.

One is the National Prisoner Statistic Data from United States Department of Justice (Office of Justice Programs. Bureau of Justice Statistics). The data are collected from each state over the years of 1978 to 2016. The data compose of 2106 records with 207 varibles.

The other one is population data of each state over the years, so that we can normalize the results by population.

Also, to be able to plot choropleth map, we also used the latitude and longitude data of the states as well as a lookup table for abbreviation to state names (and vice versa). The prisoner data only have abbreviation while the state population only have full state names.

We created two functions one for loading the raw prisoner data and merge with the state population, and another one for splitting the columns into categories and subsetting for further analysis use since we have 207 columns total for raw prisoner data.It also deal with different missing types.

In this section we load the raw data and and split them in 6 dataframes, and we are more interested in the inmates_stats, race_info, and facility_info dataframes to serve our overall analytic goal.

Missing Data Analysis

Base Columns

We can see the graph above that the state_name and Population are missing together due to the extra states added to the prison data (STATEID 60 for State prison total, STATEID 70 for US prison total (state+federal), and STATEID 99 for Federal BOP), which are not in the population table. REGION are missing for 2014-2016 NE from the original data.

Inmates Statistics

PVINCLM and PVINCLF are missing together for one record. The columns indicate if these privately inmates housed inmates are included in the total number of inmates under your jurisdiction?" They are missing for both male and female in 2016 Arkansas (AR) survey.

LFM and LFF are also missing together for one record. The columns are numbers of inmates housed in local facilities (Include local facilities under contract or other arrangement). They are missing for both male and female in 2016 Oregon (OR) survey.

Race Table and Facility Table

No missing columns except the base columns mentioned above.

Re-processing data and fix missing values

Investigating on Different Missing Types

The missing values in raw without a definite survey reason have been analysis and fixed in the previous steps. There are 4 different of missing type in the raw data which was filled with negative number -9, -8, -2, and -1.The meaning of each is described below: -9 Data are missing because the state did not respond to the item -8 Data are missing because the item was not applicable to the state -2 Item was asked, but only in the aggregate prison population, not by male or female -1 Item not asked in survey for this year

State did not respond to the item (-9)

inmates_stats

The columns that have the most missing rows due to no response from state in are LFCRINCM and LFCRINCF (the number of State inmates housed in local facilities solely to ease crowding for both male and female). Then the second most missing set of columns are LFCROWDF and LFCROWDM (variables to answer How many inmates were housed in local facilities operated by a county or other local authority that is solely to ease prison crowding?). And then the FACROWDF and FACROWDM (number of inmates who were housed in another State or in a Federal prison because there was no room for them in state correctional facilities.) comes the next.

This concludes that the state most have no respond regarding crowding issues. And answers for females are less populated for such questions than that of males.

There are also a chunk of missing happening for questions revolving if certain stats are included in jurisdiction total. For example below, FACINCLF and FACINCLM (are inmates housed in federal or other states included in total) LFINCLM and LFINCLF (are inmates housed in local facility included in total) *PVINCLM and PVINCLF (are inmates housed in private facility included in total)

These usually missing together due to state did not respond to them.

race_info

race information is missing due to state not responding are ranked as below: 1. ADDRACEM and ADDRACEF (Additional race categories) 2. NHPIM and NHPIF (Native Hawaiian or other Pacific Islander) 3. HISPF and HISPM (Hispanic or Latino) Many states at certain year might not be able to answer the question of additional race/native hawaiian or other Pacific Islander/Hispanic or do not have any of the categories of inmates.

facility_info

CAPRATET - Rated capacity (The number of beds or inmates assigned by rating officials to institutions within your jurisdiction) total CAPRATEF - Rated capacity (The number of beds or inmates assigned by rating officials to institutions within your jurisdiction) female CAPDEST - Design capacity (The number of inmates that planners or architects intended for all institutions within your jurisdiction) total CAPRATEM - Rated capacity (The number of beds or inmates assigned by rating officials to institutions within your jurisdiction) male CAPDESF - Design capacity (The number of inmates that planners or architects intended for all institutions within your jurisdiction) female CAPDESM - Design capacity (The number of inmates that planners or architects intended for all institutions within your jurisdiction) male CAPOPT - Operational capacity (The number of inmates that can be accommodated based on staff, existing programs, and services in institutions within your jurisdiction) total CAPOPF - Operational capacity (The number of inmates that can be accommodated based on staff, existing programs, and services in institutions within your jurisdiction) female CAPOPM - Operational capacity (The number of inmates that can be accommodated based on staff, existing programs, and services in institutions within your jurisdiction) male

There are patterns that the male and female rated capacity exit but the total rate capacity is not. It is easy to be calculated by male and female added together but the States decided not to answer that column. Same happened to Design capacity and Operational capacity. In the further analysis, we decided to add the Female and male together and by pass the total missing value issue.

Item not applicable to the state (-8)

Jurisdiction unsentenced are not applicable the most to some states in some years. Privately held inmates are not applicable for the second most missing variables for some states in some years. Missing are in small quantities for race, meaning most races are applicable for most of the states in most years. Again, design capacity for some years are missing for total but not missing for females and males respectively due to applicability.

only in the aggregate prison population, not by male or female (-2)

Most state have the female and male split data for most columns except the custody related columns; Race has all the information for female and males; Facilities are either available for split and aggregation or not at all.

Item not asked in survey for this year (-1)

inmates_stats CUSGT1T (custody with maximum sentence greater than 1 year, total (1978-1982 ONLY)) CUSLT1T (custody with maximum sentence 1 year or less, total (1978-1982 ONLY)) CUSUNST (custody unsentenced, total (1978-1982 ONLY)) CUSTOTT (total under custody, total (1978-1982 ONLY)) These columns are all for 1978-1982 ONLY and not asked (missing) at the same time, indicating 1978-1982 there is no cutody information is available because they were not asked in these years. race_info In certain years, certain set of race categories were not asked. facility_info CAPRATET - Rate Capacity CAPOPT - Operational Capacity CAPDEST - Design Capacity CAPRATET,CAPOPT, CAPDEST are missing the most due to not asked the year and have to missing the same time. And the split into female and male were not asked in some years along with the total.

finalized dataframe

Based on the analysis done above along with the natural of the data and some definition, we are finalizing the dataframe to use for the further analysis.

(filled missing values and replaced missing value indicators with NAs)

Data Transformation

As seen in the exploration above, most survey have been splitted with female, male and sometimes total. Hence, we created two functions to apply two types of data transformation.

stack_mft function identifies the male, female, and total columns for each question, and split the dataframe by these columns for male, female and total, and then stacks the three dataframe together with an indication column for F,M, and T. This transform the dataframe from wide to long.

sum_FM function approaches a little different, where it sums the female and male for each question together to get the total number to be comparable to the state population.

On top of that, we are also interested in looking at the geospatial cluster relationship of the jail data and trend. We use the state level geo data from maps package to assist with the need.

Since the study is focused on inmates under state jurisdiction rather than under custody, we selected all the related columns regarding each state jurisdiction inmate count. It is also a more reliable number to account for locally and privately housed inmates.

Data Transformation - inmates study

  • Transform the inmate stats dataframe to desired formats

  • Add geospatial data and plot

Total number of inmates under your jurisdiction by year Total number of inmates housed in private facilities by year

Data Transformation - inmate race study

  • Transform the race info dataframe to desired format

  • Add geospatial data and plot

Data Transformation - facility study

  • Transform the facility info dataframe to desired format and Add geospatial data and plot Above is an example usage of the reshaped data.

Results

The US federal and state prison population increased sharply from the 1970s to around 2010, decreasing modestly thereafter.

Vacancy Rates

One symptom of heightened incarceration rates is overcrowding in prisons. We can see in the below graph that many states, in 2016, had more prisoners under jurisdiction than their design-specified capacity. In fact some, states have approximately doubled the original intended capacity of their prisons. This leads to over-crowding of current prisons or increased funding of new public or private prisons.

Closing Remarks

It appears that the increases in prison population is driven both by by federal policies, such as the crime bills of 1984 and 1994, as well as state-specific policy. We expect to have seen a more even increase in per-capita prison populations across states if the primary driver was federal legislation, but the evidence is clear that this is not the case. However, since all states have seen increases in per-capita prison population, it appears that federal legislation plays a role. Regardless of cause, the evidence is clear that the prison system is currently bloated, demanding either more funding or fewer prisoners.

Interactive Component

Interactive Racial Data

In the part of Demographic Trends Analysis we analyze the change of the proportion of different races over the past few decades over the whole country. This shiny app casts it down to states level. You can see how the proportion of inmates with different races changed in the past few decades and the diffence of the proportion of diffent states.

You can play with our shiny app at https://tianyaohan.shinyapps.io/race_analysis/

Conclusion

We ran into a few limitations during the project that made things more difficult. First off, there is no clear-cut story for something as vast and nuanced as the US Prison System, beyond just saying that the size of the population has grown. Nonetheless, there are many interesting trends to observe, as mentioned in the Results section. There are two main things that could be done for future steps. The first is to obtain data on topics in more detail, such as counts for type of crime or size of private prisons. The second would be to turn an eye toward state-level policy and dig into the weeds of which policies were implemented to drive higher growth in certain states. These two steps proved nebulous to us given the scarcity of good data and the vastness of state-level legislation. The biggest lesson we learned is that data is messy in the real world. The data available to you is often not quite robust enough to do exactly what was originally planned. However, there is often a story to be found nonetheless based on what is available. We believe that we gained proficiency at playing to the advantages of the data available.